Synopsis

We build a random forest classifier model on the Weight Lifting Exercises Dataset1 to predict how well an activity is done.

About the dataset:

Packages

Load the R packages needed further. It’s also good to set system locale to avoid problems related with system differences between regions.

library(data.table, quietly = T, warn.conflicts = F)
library(caret, quietly = T,warn.conflicts = F)
library(dplyr, quietly = T, warn.conflicts = F)
library(plotly, quietly = T, warn.conflicts = F)
library(doParallel, quietly = T, warn.conflicts = F)
library(ggpubr, quietly = T, warn.conflicts = F)
Sys.setlocale('LC_ALL','English')  
## [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"

Getting data

Start by creating a directory to store data and download it from the web. While doing it, create a text file which states the time/timezone of download for reference purposes.

datadir <- './data'
trainingurl <- 'https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv'
testurl <- 'https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv'
trainingpath <- paste0(datadir,'/pml-training.csv')
testpath <- paste0(datadir,'/pml-testing.csv')

if(!dir.exists(datadir)){
        dir.create(datadir)
}

if(!file.exists(trainingpath)){
        
        download.file(url = trainingurl,
                      destfile = trainingpath,
                      method = 'curl')
        download.file(url = testurl,
                      destfile = testpath,
                      method = 'curl')
        
        time <- as.character(Sys.time())
        timezone <- Sys.timezone()
        
        downloadinfo <- data.frame(list(time = time, 
                             format = "%Y-%m-%d %H:%M:%S",
                             timezone = timezone))
        
        write.table(x = downloadinfo,
                    file = paste0(datadir,'/downloadinfo.txt'),
                    row.names = F)
}

Brief Exploratory Data Analysis

Use fread to load data into R environment.

training <- fread(file = trainingpath, data.table = F,stringsAsFactors = T)
test <- fread(file = testpath, data.table = F,stringsAsFactors = T)

There is lots of NA values in this dataset, so to avoid any trouble I’m gonna filter out columns that contains any of them and select the variables of interest to the model. Our response variable is named ‘classe’ in the dataset. For the purpose of this project, I’m gonna select only the spacial variables as explanatory variables. I expect that those variables contains enough information about the movements for the random forest’s trees to be able to group them to each class. Same thing is applied to test set.

training1 <- training %>% 
        select(!where(anyNA)) %>%
        select(classe,ends_with(c('x','y','z')))
test1 <- test %>% 
        select(!where(anyNA)) %>%
        select(problem_id,ends_with(c('x','y','z')))

Let’s take a look at a 3dplot of some spatial variable to see if we are lucky to find any pattern that calls attention.

training1 %>% plot_ly(
        x=.$gyros_belt_x,
        z = .$gyros_belt_z, 
        y = .$gyros_belt_y,
        type = 'scatter3d',
        color = .$classe,
        mode = 'markers')

It seems like we were lucky here. We can see there is some coordinates where only the class E appears, same with class D (you can play around selecting which classes you wanna see at the legend box). We hope that the random forest will be able to find more of those patterns and make a good model of it.

Building model

Let’s get to it. We are gonna use the caret package to build a random forest model with the variables we selected previously.

Random forests is a modification of bagging. It takes a “committee” of low performance trees to make each one a prediction, then averages the results from them. This is especially good for variance of the model, because with the average, the noises of each tree can be “neutralized”. Bias remains the same though, since each tree is identically distributed2.

For out of sample error to be evaluated, one important aspect of the model is how data is gonna be sampled from the pool of data and ‘fed’ to the model. With K-fold cross-validation with replacement, k folds are made sampling from the pool of data, with replacement. The train function from caret will then evaluate a model with each fold then decide which one gives better out of sample performance. We use here 5-fold cross-validation.

This way we expect random forest to neutralize variance due to noises from each tree, and variance due to model bias will be taken care through the cross-validation evaluation.

Due to lake of computation power, number of trees is set to 200, even though this might hurt overall performance. We will check later how this might affect accuracy with a graph.

setControl <- trainControl(method = "cv",
number = 5,
allowParallel = TRUE)
model <- train(classe~.,
               data = training1,
               method = 'rf',
               ntree = 200,
               trControl = setControl)
model$finalModel
## 
## Call:
##  randomForest(x = x, y = y, ntree = 200, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 200
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 0.89%
## Confusion matrix:
##      A    B    C    D    E class.error
## A 5572    2    0    5    1 0.001433692
## B   30 3749   18    0    0 0.012641559
## C    1   26 3393    1    1 0.008474576
## D    3    0   77 3133    3 0.025808458
## E    0    1    1    5 3600 0.001940671

That’s our final model, the estimated out of sample error is 0.89%, which is reasonably low. The tuning parameter used by the train function is number of randomly selected predictors (explanatory variables) to each tree. At the plot below, we can see how accuracy changed among each one of those parameter and, at the final model, in each cross-validation k-fold.

plot1<-ggplot(model) + theme_bw()
plot2<-ggplot(data = model$resample, aes(x = Resample, y = Accuracy)) + geom_point() + theme_bw()
ggarrange(plot1,plot2,nrow=2)

We can see accuracy falls as the number of randomly selected predictors grows. That way, 2 predictors was chosen as the final model. Among the k-folds, accuracy was kept at a high value.

Now, let’s take a look on how error rate changes as the number of trees grows. There were some worry that this would affect performance.

errors <- data.table(model$finalModel$err.rate, trees = 1:200)
errors2 <- melt(errors,7)
plot3<- ggplot(data = errors2, aes(x = trees, y = value))
plot3<- plot3 + geom_line() + theme_bw() + facet_wrap(.~variable)
plot3<- plot3 + labs(y = 'Error Rate',x = 'Trees')
plot3

We can see that around 100 trees, the error rate practically stabilizes around some value, for overall out of bag sample and each outcome class. So there is not much worries about that now.

Prediction

Now we are gonna use the model to predict the test set.

pred <- predict(model, test1)
cbind.data.frame(pred,test1$problem_id)
##    pred test1$problem_id
## 1     B                1
## 2     A                2
## 3     B                3
## 4     A                4
## 5     A                5
## 6     E                6
## 7     D                7
## 8     B                8
## 9     A                9
## 10    A               10
## 11    B               11
## 12    C               12
## 13    B               13
## 14    A               14
## 15    E               15
## 16    E               16
## 17    A               17
## 18    B               18
## 19    B               19
## 20    B               20

As we have seen previously, the estimated out of sample error is only 0.89%, so we estimated 100% of the test cases correctly.


  1. Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013. Read more: http://groupware.les.inf.puc-rio.br/har#dataset#ixzz6c6x1JE00↩︎

  2. The Elements Of Statistical Learning, 2nd Edition, 2009. Acces Here.↩︎